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Abstract 

This paper discusses the notion of generahzation of training samples over 
long distances in the input space of a feedforward neural network. Such a 
generalization might occur in various ways, that differ in how great the 
contribution of different training features should be. 

The structure of a neuron in a feedforward neural network is analyzed and 
it is concluded, that the actual performance of the discussed generalization 
in such neural networks may be problematic - while such neural networks 
might be capable for such a distant generalization, a random and spurious 
generalization may occur as well. 

To illustrate the differences in generalizing of the same function by dif- 
ferent learning machines, results given by the support vector machines are 
also presented. 

keywords: supervised learning, generalization, feedforward neu- 
ral network, support vector machine 

1 Introduction 



Generalization is one of the basic notions in machine learning. Yet, in the exist- 
ing literature, usually only the indicators of generalization quality like the mean 
square error over the test samples are presented, without a more detailed study of 
the characteristics of the generalization functions produced by different learning 
machines. 

In this paper, a special kind of generalization is analyzed, on the example of 
classic feedforward neural networks with linear weight functions. In the discussed 
generalization type, generalized samples exist which are distant to any training 
samples. The distance of two samples is defined as the distance d between the 
independent variables of the samples, in the input space of a feedforward learning 
machine L. For example, let the sample Si be {x\,X2,y^) where the independent 
variables are x\ and X2, and the dependent variable is y\ Then, the discussed dis- 
tance d between two samples Sp and Sg might be defined as the Euclidean distance 
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between the points in the input space of L, whose coordinates are the indepen- 
dent variables {xi,X2) and (xljxl). If a generahzed sample Sg is distant from any- 
training samples, it means that there are different groups of training samples, that 
might be expected to compete in generalizing Sg. 
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Figure 1: An example of a close sample C and a distant sample D in an input 
space of a feedforward learning machine. 

Let us discuss examples of the distant and, conversely, close samples. Fig. [1] 
illustrates an input space of a feedforward learning machine. Let the learning 
machine has two inputs xi and X2. Let there be some samples in the space, 
whose independent variables {x\,x\) determine the respective position in the input 
space, and which have a dependent variable y^. Let the training samples have the 
values of y^ equal to either or 1, and let us call these samples '0' or '1' samples, 
respectively. Let there be also two generalized samples absent in the training set, 
whose dependent variables are unknown, and thus their y^ values are denoted by a 
and b. The sample with y^ = a, let us call it C, can be regarded as a close one - it 
is near only to a cluster of '1' samples, and it is likely that the user of the learning 
machine expects that the dependent variable of the sample should be estimated to 
a value that is close to 1. Let the sample with y^ = b he called D. At least three 
obvious ways of generalization of D can be thought of: 

• In the surrounding of D, there are some '0' samples and some '1' samples in 
an approximate balance, thus, the dependent variable of D should be equal 
to about 0.5. 

• All samples '1' create together a single horizontal stripe-shaped feature, and 
D is inside the feature. Additionally, 'O's create two horizontal stripe-shaped 
features and D is outside each one. Thus, the dependent variable of D should 
be equal to about 1. 

• The closest training sample to D is '0', so, the dependent variable of D should 
be equal to about 0. 

Thus, groups of samples of different type were discerned around D, that can 
compete in generalizing of D. The sample D is thus regarded as a distant sample. 



2 



It will be shown, that such alternate ways of generalization, in the case of 
the feedforward neural networks, may sometimes produce a random and spurious 
generalization. That is, the problem of long distance generalization may sometimes 
be solved well by the neural network, but in some other cases the network may 
give quite unexpected results, being the artifacts revealing an internal structure of 
the learning machine rather than a likely estimation hypothesis. 

The performance of support vector machines will be presented as well, to show 
the generalization differences that exist between different types of learning ma- 
chines. 

2 Distant generalization in feedforward neural 
networks 

In a feedforward neural network (FNN), the combination function in a neuron of 
the McCuUoch type [5] is a linear combination of the input values of the neuron. 
To obtain the output value of the neuron, the value of the combination function is 
non-linearly transformed, typically using a sigmoidal or hyperbolic tangent activa- 
tion function. It means that the neuron acts the same for arguments that create 
hyperplanes in the space of the domain of the neuron. For example, there is a 
hyperplane Pj, for which the output value of the neuron is constant and equal to 
i. The partial derivatives of the neuron function against each of the inputs of the 
neuron are constant for Pi as well. It might be said, thus, that a trained neuron 
transfers the properties of some samples, that it learned during the training process, 
over infinitely large regions in the input space of the neuron, because hyperplanes 
are infinite. The infinity of the transfer might make FNNs good for distant gen- 
eralizations, as it will be further shown in tests. On the other hand, though, the 
infinite transfer may sometimes produce wrong results, because a training sample 
St may influence on the generalization of some sample Sg even if these samples are 
very distant from each other. But, intuitively, samples that are very far from each 
other might have nothing in common. 

3 Tests 

Let us discuss a real process of training a FNN with two kinds of data - the first 
one, 01, deliberately constructed to simplify the distant generalization, and the 
second one, 6c, constructed to make the generalization complex to solve by the 
FNN. The two three-dimensional sets are illustrated in Fig. [2]^a) and Fig. Mjo), 
respectively. The sets are 64 x 64 images. Let the coordinates of the pixels be 
the two independent variables, and the brightnesses of the pixels be the dependent 
variable. 

Let the pixel at the lower left corner has the coordinates (—0.5, —0.5) and let 
the pixel at the upper right corner has the coordinates (0.5, 0.5). Let the brightness 
of the pixels represents the range from —0.5 for black to 0.5 for white. 

Let the feedforward layered densely connected networks with two inputs and 
a single neuron in the output layer be used. Let the sizes of the FNNs be such 
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(a) (b) (c) 

Figure 2: The data sets (a) 6i, (b) 6^ and (c) the mask of the training subsets. 

that they can comfortably fit to both of the generahzed sets - it was tested that 
it is sufficient if each of the networks has two hidden layers of 16 neurons each. 
Let the FNNs have classic hyperbolic tangent activation functions. Let there be 
a weight decay at a rate of 2 • 10~^ to improve generalization [1]. Let an online 
backpropagation training be used [B] with a fixed learning step of 0.02. 

The training subsets of both the set 9i and the set 6c are represented by the 
image in Fig. [2]^c) - the black pixels in the image mean that the corresponding 
pixels in Fig. [2]^a) and Fig. Mjo) represent the training subsets of the respective sets. 
Thus, the white region in Fig. [2](c) is the unknown one during training. Because 
the unknown region is relatively large in comparison to the sizes of the features in 
the training sets, it can be told that the generalization to the region employs the 
distant generalization. 

Let four of these neural networks, Af^, i = ... 3, be trained with the training 
subset of 9i, and let the other four of these neural networks Mf, i = 0. . .3, be 
trained with the training subset of 6^. During the training, the generalizing func- 
tions of the networks and the weights of the neurons in the first hidden layer were 
sampled, at the iterations 10000000th, 31622777th and 100000000th. The results 
are illustrated in Fig. [31 In the figure, there is a two row table for each of the iter- 
ations at which the sampling was done. The sampled generalization functions are 
placed the upper row and the diagrams representing the input spaces of neurons in 
the first hidden layer are placed respectively in the lower row. The representation 
of the generalization functions is analogous to that of the sets 6i and Oc- Each of 
the input space diagrams shows with translucent lines the zeroes of the outputs 
of the first hidden layer neurons, that is, it shows the hyperplanes Pq in the input 
space of the tested FNNs. The lower left corner of the dotted rectangles drawn 
within the diagrams represents the input values at (—0.5, —0.5) and the upper 
right corner of the rectangles represents the input values at (0.5,0.5). 

Let us divide the features in the training sets into the linear ones // being the 
three white lines, and the circular ones fc being the four white circles. It is visible 
in Fig. [3l that in the case of J\fl most hyperplanes concentrate near the linear 
features fi, and in the case of Af[ generally some hyperplanes concentrate near 
the linear features fi and some concentrate near the circular features fc- In the 
latter case, in effect, the hyperplanes concentrated near fc cross the hyperplanes 
concentrated near fi. Additionally, the crossings occur partially in the unknown 
region, i. e. in the region marked in Fig. [2]^c) by white. These are exactly the 
conditions showing the discussed notion of competing groups of samples. While in 
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Iteration Ml Ni Hi Mi M^ Ml M^ M^ 



10000000 



31622777 



100000000 




Figure 3: The generalizing functions and diagrams of the zeroes of the first hidden 
layer neurons. 

the case of Ml the neurons transferred only the properties of // over the unknown 
region, in the case of M^ some neurons extend their hyperplanes onto the unknown 
region from the region of and some other from the region of fc- Thus, properties 
of both /; and fc are transmitted to the unknown region. 

The differences between Ml and Mi are clearly visible. Ml finely generalized // 
over the unknown region, while Mi produced in the unknown region some features 
that look like random artifacts. Thus, it might be told that the discussed distant 
generalization was resolved in some cases in a fine way, and in some cases in a 
rather spurious way by the tested FNNs. An example alternate solution without 
the artifacts might be to generalize to the unknown region in the case of the set 
6c in the same way as it happened in the tests in the case of the set 6*/, that is, 
just generalize the features fi over the unknown region, because //, and not /c, are 
directly neighboring to the unknown region. 

Let us compare the FNNs to SVMs [HE]. SVMs give very different results for 
both sets. Example results are illustrated in Fig. HI The particular example used 
z^-SVC [7] trained using LIBSVM [2]. 

In the particular examples, SVMs solved the problem of distant generalization 
in a different way than the tested FNNs in the case of both the set 6i and the set 
Oc- The SVMs were able to produce a generalization with minimal artifacts if their 
learning coefficients allowed for a proper fitting to the training data, as seen in 
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Figure 4: Examples of generalization using Z/-SVC with the radial basis kernel 
with p = 0.2, e = 0.001 and: for the binarized 9i set with a threshold at 0.5 (a) 
c = 0.3, 7 = 3, (b) c = 1, 7 = 10, (c) c = 3, 7 = 30, for the binarized 9c set with a 
threshold at 0.5 (d) c = 1, 7 = 10, (e) c = 3, 7 = 30, (f) c = 10, 7 = 100. 

Fig. m^c) and (f). The SVMs have a large test error for both sets, though, as they 
did not fuse fi into a single set of parallel bars. 

Thus, FNNs have a smaller test error for 9i, because they could fuse the features 
//, and both FNNs and SVMs have a relatively large test error for 6^, but for 
different reasons. 

4 Conclusions 

The distant generalization may work quite differently for different training sets and 
for different learning machines. In particular, the resulting generalizing functions 
may contain artifacts, related to the internal structure of the learning machine. 

Study of these differences might give more clues for using a particular learning 
machines for a particular task, than the comparison of the test MSE alone would 
give. 

For example, the classic FNNs with linear combination functions and hyperbolic 
tangent activation functions may introduce substantial random artifacts to the 
generalizing functions. In some applications where the stability of the results is 
important, usage of such FNNs might thus be discouraged. But, conversely, the 
tested FNNs, thanks to the structure of neurons, can be capable of generalizing by 
extending and fusing together elongated features that exist in the training set. 
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